第一節：廣義線性模式-1(1)

這節課我們要帶領大家做兩個常見的多變項統計分析。

– 請至這裡下載本週的範例資料

dat = read.csv("Example data.csv", header = TRUE)
head(dat)

##       eGFR Disease Survival.time Death Diabetes Cancer      SBP      DBP
## 1 34.65379       1     0.4771037     0        0      1 121.2353 121.3079
## 2 37.21183       1     3.0704424     0        1      1 122.2000 122.6283
## 3 32.60074       1     0.2607117     1        0      0 118.9136 121.7621
## 4 29.68481       1            NA    NA        0      0 118.2212 112.7043
## 5 28.35726       0     0.1681673     1        0      0 116.7469 115.7705
## 6 33.95012       1     1.2238556     0        0      0 119.9936 116.3872
##   Education Income
## 1         2      0
## 2         2      0
## 3         0      0
## 4         1      0
## 5         0      0
## 6         1      0

我們先介紹GLM，他主要的功能在做多變項分析(多個X->1個Y)，其中他對於X的型態(可以接受類別變項&連續變項)並沒有太多的限制，也可以接受數種Y的型態。

– 最常見的Y分別為：

連續變項 - 假設誤差為常態分佈 (這就是線性迴歸)
二元變項 - 假設機率服從邏輯斯分布 (這就是邏輯斯迴歸)

第一節：廣義線性模式-1(2)

這是Y為連續變項的程式碼：

Formula1 = "eGFR ~ factor(Diabetes) + SBP + factor(Income)"
Result1 = glm(Formula1, family = "gaussian", data = dat)
Result1

## 
## Call:  glm(formula = Formula1, family = "gaussian", data = dat)
## 
## Coefficients:
##       (Intercept)  factor(Diabetes)1                SBP  
##         -44.68079           -0.03554            0.64674  
##   factor(Income)1    factor(Income)2  
##           0.32995            1.18211  
## 
## Degrees of Freedom: 896 Total (i.e. Null);  892 Residual
##   (103 observations deleted due to missingness)
## Null Deviance:       45610 
## Residual Deviance: 2669  AIC: 3536

Result2 = summary(Result1)
Result2

## 
## Call:
## glm(formula = Formula1, family = "gaussian", data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.4186  -1.1374  -0.0075   1.0652   6.0488  
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -44.680787   0.705263 -63.353  < 2e-16 ***
## factor(Diabetes)1  -0.035537   0.162866  -0.218    0.827    
## SBP                 0.646741   0.005857 110.423  < 2e-16 ***
## factor(Income)1     0.329949   0.193766   1.703    0.089 .  
## factor(Income)2     1.182107   0.186438   6.340 3.63e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.992399)
## 
##     Null deviance: 45608.2  on 896  degrees of freedom
## Residual deviance:  2669.2  on 892  degrees of freedom
##   (103 observations deleted due to missingness)
## AIC: 3535.7
## 
## Number of Fisher Scoring iterations: 2

也可以這樣打指令… (要好好學，這種型態的程式碼應用比較廣泛)

dat[,"Diabetes"] = as.factor(dat[,"Diabetes"])
dat[,"Income"] = as.factor(dat[,"Income"])
glm(dat[,"eGFR"] ~ ., family = "gaussian", data = dat[,c("Diabetes", "SBP", "Income")])

## 
## Call:  glm(formula = dat[, "eGFR"] ~ ., family = "gaussian", data = dat[, 
##     c("Diabetes", "SBP", "Income")])
## 
## Coefficients:
## (Intercept)    Diabetes1          SBP      Income1      Income2  
##   -44.68079     -0.03554      0.64674      0.32995      1.18211  
## 
## Degrees of Freedom: 896 Total (i.e. Null);  892 Residual
##   (103 observations deleted due to missingness)
## Null Deviance:       45610 
## Residual Deviance: 2669  AIC: 3536

第一節：廣義線性模式-1(3)

模式跑出來了，如果我們呼叫物件『Result2』會看到相當多資訊

– 我們挑幾個比較重要的來講…

Estimate: 這是斜率的估計值，一般來說我們檢定他是否等於0，若等於0則代表該因子不影響Y的變化。
Std. Error: 這是斜率的估計值的標準誤，95% ci是用他來計算的。
t value: 這是檢定Estimate是否顯著的檢定值，他等於Estimate/Std. Error。
Pr(>|t|): 這是t value相對應的p value。

– 以上這些數值可以利用函數「ls()」找找看在哪裡…

COEF = Result2$coefficients
COEF

##                       Estimate  Std. Error     t value     Pr(>|t|)
## (Intercept)       -44.68078691 0.705262678 -63.3533977 0.000000e+00
## factor(Diabetes)1  -0.03553692 0.162865536  -0.2181979 8.273248e-01
## SBP                 0.64674071 0.005856955 110.4226792 0.000000e+00
## factor(Income)1     0.32994860 0.193765940   1.7028204 8.895005e-02
## factor(Income)2     1.18210745 0.186438265   6.3404766 3.634238e-10

第一節：廣義線性模式-1(4)

有一些基本的運算我們需要學會，首先是利用t value計算p value(需要使用函數「pt()」)

my.pvalue = pt(abs(COEF[,3]), df = Result2$df[2], lower.tail = FALSE) * 2
all.equal(my.pvalue, COEF[,4])

## [1] TRUE

這邊有點難懂，了解這串程式碼需要一些統計背景。簡單來說，在R裡面我們可以簡單的利用下列函數查詢幾個常見分布的資訊：

qnorm(0.975)

## [1] 1.959964

pnorm(1.96)

## [1] 0.9750021

qchisq(0.975, df = 1)

## [1] 5.023886

pchisq(3.84, df = 1)

## [1] 0.9499565

第一節：廣義線性模式-1(5)

接著，我們要計算95% ci…

m = COEF[,1]
s = COEF[,2]

#計算下界
ci.low = m - qt(0.975, df = Result2$df[2]) * s
ci.low

##       (Intercept) factor(Diabetes)1               SBP   factor(Income)1 
##      -46.06495450       -0.35518122        0.63524569       -0.05034167 
##   factor(Income)2 
##        0.81619867

#計算上界
ci.up = m + qt(0.975, df = Result2$df[2]) * s
ci.up

##       (Intercept) factor(Diabetes)1               SBP   factor(Income)1 
##       -43.2966193         0.2841074         0.6582357         0.7102389 
##   factor(Income)2 
##         1.5480162

還記得函數「paste0()」以及函數「formatC()」嗎?我們可以把它們貼起來：

m.3 = formatC(m, 3, format = "f")
ci.low.3 = formatC(ci.low, 3, format = "f")
ci.up.3 = formatC(ci.up, 3, format = "f")
paste0(m.3, " (", ci.low.3, " to ", ci.up.3, ")")

## [1] "-44.681 (-46.065 to -43.297)" "-0.036 (-0.355 to 0.284)"    
## [3] "0.647 (0.635 to 0.658)"       "0.330 (-0.050 to 0.710)"     
## [5] "1.182 (0.816 to 1.548)"

第一節：廣義線性模式-1(6)

這是Y為二元變項的程式碼：

Result3 = glm(dat[,"Disease"] ~ ., family = "binomial", data = dat[,c("Diabetes", "SBP", "Income")])
Result3

## 
## Call:  glm(formula = dat[, "Disease"] ~ ., family = "binomial", data = dat[, 
##     c("Diabetes", "SBP", "Income")])
## 
## Coefficients:
## (Intercept)    Diabetes1          SBP      Income1      Income2  
##   -0.623638     0.353512     0.013074    -0.007878     0.057746  
## 
## Degrees of Freedom: 876 Total (i.e. Null);  872 Residual
##   (123 observations deleted due to missingness)
## Null Deviance:       1013 
## Residual Deviance: 1005  AIC: 1015

Result4 = summary(Result3)
Result4

## 
## Call:
## glm(formula = dat[, "Disease"] ~ ., family = "binomial", data = dat[, 
##     c("Diabetes", "SBP", "Income")])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9365  -1.5102   0.7647   0.8141   0.9526  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.623638   0.936715  -0.666   0.5056  
## Diabetes1    0.353512   0.232143   1.523   0.1278  
## SBP          0.013074   0.007823   1.671   0.0947 .
## Income1     -0.007878   0.259658  -0.030   0.9758  
## Income2      0.057746   0.257617   0.224   0.8226  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1013.4  on 876  degrees of freedom
## Residual deviance: 1005.0  on 872  degrees of freedom
##   (123 observations deleted due to missingness)
## AIC: 1015
## 
## Number of Fisher Scoring iterations: 4

第一節：廣義線性模式-1(7)

這就是邏輯斯迴歸，與上面的線性迴歸差別在2點：

線性迴歸是使用t value進行檢定，而邏輯斯迴歸是使用z value進行檢定
邏輯斯迴歸比較不常使用『斜率』以及其95% CI作結果的呈現，而較常呈現『勝算比(OR, Odds ratio)』以及其95% CI

因此，我們剛剛學的兩串程式碼要做一定程度的改變

COEF = Result4$coefficients

#計算p value(使用函數「pnorm()」)
my.pvalue = pnorm(abs(COEF[,3]), lower.tail = FALSE) * 2
all.equal(my.pvalue, COEF[,4])

## [1] TRUE

#計算95% CI
m = COEF[,1]
s = COEF[,2]
ci.low = m - qnorm(0.975) * s
ci.up = m + qnorm(0.975) * s

#轉變為OR (這是唯一增加的部分)
m.OR = exp(m)
ci.low.OR = exp(ci.low)
ci.up.OR = exp(ci.up)

#標準化輸出
m.OR.3 = formatC(m.OR, 3, format = "f")
ci.low.OR.3 = formatC(ci.low.OR, 3, format = "f")
ci.up.OR.3 = formatC(ci.up.OR, 3, format = "f")
paste0(m.OR.3, " (", ci.low.OR.3, " to ", ci.up.OR.3, ")")

## [1] "0.536 (0.085 to 3.361)" "1.424 (0.903 to 2.245)"
## [3] "1.013 (0.998 to 1.029)" "0.992 (0.596 to 1.650)"
## [5] "1.059 (0.639 to 1.755)"

練習1：展示Paper上的資訊

在R裡面進行廣義線性模式時，不會呈現出95%信賴區間。因此，練習題是寫一個函數，擁有下列功能：

使用者能隨意輸入1個依變項 & 數個自變項進行運算

2_1. 使用者輸入的依變項若為連續變項，則進行線性回歸運算

2_2. 使用者輸入的依變項若為類別變項，則先確認是否僅有0與1兩個類別，確認成功才進行運算

運算結束後，計算點估計及95% CI
將運算出來的95% CI擴充在原來的係數矩陣右邊，呈現類似這樣的狀況：

##             Estimate Std. Error z value  Pr(>|z|) OR (95% CI)             
## (Intercept) "-0.624" "0.937"    "-0.666" "0.506"  "0.536 (0.085 to 3.361)"
## Diabetes1   "0.354"  "0.232"    "1.523"  "0.128"  "1.424 (0.903 to 2.245)"
## SBP         "0.013"  "0.008"    "1.671"  "0.095"  "1.013 (0.998 to 1.029)"
## Income1     "-0.008" "0.260"    "-0.030" "0.976"  "0.992 (0.596 to 1.650)"
## Income2     "0.058"  "0.258"    "0.224"  "0.823"  "1.059 (0.639 to 1.755)"

##                   Estimate  Std. Error t value   Pr(>|t|)
## (Intercept)       "-44.681" "0.705"    "-63.353" "<0.001"
## factor(Diabetes)1 "-0.036"  "0.163"    "-0.218"  "0.827" 
## SBP               "0.647"   "0.006"    "110.423" "<0.001"
## factor(Income)1   "0.330"   "0.194"    "1.703"   "0.089" 
## factor(Income)2   "1.182"   "0.186"    "6.340"   "<0.001"
##                   Slope (95% CI)                
## (Intercept)       "-44.681 (-46.065 to -43.297)"
## factor(Diabetes)1 "-0.036 (-0.355 to 0.284)"    
## SBP               "0.647 (0.635 to 0.658)"      
## factor(Income)1   "0.330 (-0.050 to 0.710)"     
## factor(Income)2   "1.182 (0.816 to 1.548)"

練習1答案(1)

這題相對簡單，但思路仍然要很清晰：

my_func.glm = function (y, x_data) {
  
  lvl.y = levels(factor(y))
  n.lvl.y = length(lvl.y)  
  
  if (n.lvl.y < 10) {
    
    if (mean(lvl.y %in% 0:1) == 1) {
      
      Result = glm(y ~ ., family = "binomial", data = x_data)
      summary_Result = summary(Result)
      COEF = summary_Result$coefficients
      
      m = COEF[,1]
      s = COEF[,2]
      
      #計算下界
      ci.low = m - qnorm(0.975) * s
      
      #計算上界
      ci.up = m + qnorm(0.975) * s
      
      #轉變為OR (這是唯一增加的部分)
      m.OR = exp(m)
      ci.low.OR = exp(ci.low)
      ci.up.OR = exp(ci.up)
      
      #標準化輸出
      m.OR.3 = formatC(m.OR, 3, format = "f")
      ci.low.OR.3 = formatC(ci.low.OR, 3, format = "f")
      ci.up.OR.3 = formatC(ci.up.OR, 3, format = "f")
      
      COEF = cbind(COEF, paste0(m.OR.3, " (", ci.low.OR.3, " to ", ci.up.OR.3, ")"))
      
      colnames(COEF)[5] = "OR (95% CI)"
      
      COEF
      
    } else {
      
      cat("y為類別變項時僅能接受0與1的類別變項\n")
      
    }
    
  } else {
    
    Result = glm(y ~ ., family = "gaussian", data = x_data)
    summary_Result = summary(Result)
    COEF = summary_Result$coefficients

    m = COEF[,1]
    s = COEF[,2]
    
    #計算下界
    ci.low = m - qt(0.975, df = summary_Result$df[2]) * s
    
    #計算上界
    ci.up = m + qt(0.975, df = summary_Result$df[2]) * s
    
    m.3 = formatC(m, 3, format = "f")
    ci.low.3 = formatC(ci.low, 3, format = "f")
    ci.up.3 = formatC(ci.up, 3, format = "f")
    COEF = cbind(COEF, paste0(m.3, " (", ci.low.3, " to ", ci.up.3, ")"))
    
    colnames(COEF)[5] = "Slope (95% CI)"
    
    COEF
    
  }
  
}

練習1答案(2)

試用一下結果：

dat[,"Income"] = as.factor(dat[,"Income"])

my_func.glm(y = dat$Education, x_data = dat[,c("SBP", "Income")])

## y為類別變項時僅能接受0與1的類別變項

my_func.glm(y = dat$Disease, x_data = dat[,c("SBP", "Income")])

##             Estimate               Std. Error           
## (Intercept) "-1.21299612535807"    "0.87057456391603"   
## SBP         "0.018343099853456"    "0.00717946299007011"
## Income1     "-0.00331543024980331" "0.254325232647553"  
## Income2     "0.11049049150071"     "0.25577927092588"   
##             z value               Pr(>|z|)            
## (Intercept) "-1.39332824049183"   "0.163520554560428" 
## SBP         "2.5549403735107"     "0.0106205989530202"
## Income1     "-0.0130361829036361" "0.989598925526899" 
## Income2     "0.431975941993863"   "0.665758898626829" 
##             OR (95% CI)             
## (Intercept) "0.297 (0.054 to 1.638)"
## SBP         "1.019 (1.004 to 1.033)"
## Income1     "0.997 (0.605 to 1.641)"
## Income2     "1.117 (0.676 to 1.844)"

my_func.glm(y = dat$eGFR, x_data = dat[,c("SBP", "Income")])

##             Estimate            Std. Error            t value            
## (Intercept) "-44.5446073390111" "0.650718644728415"   "-68.4544813643726"
## SBP         "0.645468680757481" "0.00532053317500706" "121.316540941712" 
## Income1     "0.320945790901737" "0.189850880461012"   "1.69051515653964" 
## Income2     "1.18312335438857"  "0.183056090544938"   "6.46317394229569" 
##             Pr(>|t|)               Slope (95% CI)                
## (Intercept) "0"                    "-44.545 (-45.822 to -43.268)"
## SBP         "0"                    "0.645 (0.635 to 0.656)"      
## Income1     "0.0912703998742093"   "0.321 (-0.052 to 0.694)"     
## Income2     "1.66581914375454e-10" "1.183 (0.824 to 1.542)"

第二節：廣義線性模式-2(1)

現在我們還遇到一個問題，那就是像剛剛的結果中，Income有兩個類別各有一個p value，且一個顯著一個不顯著，這樣Income整體來說應該跟eGFR有關還無關呢??

##                   Estimate  Std. Error t value   Pr(>|t|)
## (Intercept)       "-44.681" "0.705"    "-63.353" "<0.001"
## factor(Diabetes)1 "-0.036"  "0.163"    "-0.218"  "0.827" 
## SBP               "0.647"   "0.006"    "110.423" "<0.001"
## factor(Income)1   "0.330"   "0.194"    "1.703"   "0.089" 
## factor(Income)2   "1.182"   "0.186"    "6.340"   "<0.001"
##                   Slope (95% CI)                
## (Intercept)       "-44.681 (-46.065 to -43.297)"
## factor(Diabetes)1 "-0.036 (-0.355 to 0.284)"    
## SBP               "0.647 (0.635 to 0.658)"      
## factor(Income)1   "0.330 (-0.050 to 0.710)"     
## factor(Income)2   "1.182 (0.816 to 1.548)"

我們可以使用Wald test來整合這兩個p value，計算過程需要矩陣運算，所以有一定程度的困難，同學把下列計算過程抄起來即可：

dat[,"Diabetes"] = as.factor(dat[,"Diabetes"])
dat[,"Income"] = as.factor(dat[,"Income"])
Result1 = glm(dat[,"eGFR"] ~ ., family = "gaussian", data = dat[,c("Diabetes", "SBP", "Income")])
Result2 = summary(Result1)

COEF = Result2$coefficients
COEF

##                 Estimate  Std. Error     t value     Pr(>|t|)
## (Intercept) -44.68078691 0.705262678 -63.3533977 0.000000e+00
## Diabetes1    -0.03553692 0.162865536  -0.2181979 8.273248e-01
## SBP           0.64674071 0.005856955 110.4226792 0.000000e+00
## Income1       0.32994860 0.193765940   1.7028204 8.895005e-02
## Income2       1.18210745 0.186438265   6.3404766 3.634238e-10

VCOV = Result2$cov.scaled
VCOV

##              (Intercept)     Diabetes1           SBP      Income1
## (Intercept)  0.497395445  0.0402153065 -4.110204e-03 -0.008966506
## Diabetes1    0.040215306  0.0265251829 -3.688336e-04 -0.001223519
## SBP         -0.004110204 -0.0003688336  3.430393e-05  0.000040765
## Income1     -0.008966506 -0.0012235192  4.076500e-05  0.037545239
## Income2     -0.005350712 -0.0002702795  9.676645e-06  0.004237876
##                   Income2
## (Intercept) -5.350712e-03
## Diabetes1   -2.702795e-04
## SBP          9.676645e-06
## Income1      4.237876e-03
## Income2      3.475923e-02

#將第4項與第5項進行整合
variables = 4:5
DF = length(variables)
CHISQ = t(COEF[variables,1]) %*% solve(VCOV[variables,variables]) %*% COEF[variables,1]
PVALUE = pchisq(CHISQ, df = DF, lower.tail = FALSE)
PVALUE

##              [,1]
## [1,] 1.169025e-09

#僅檢定第4項 (結果近似於使用t value進行檢定)
variables = 4
DF = length(variables)
CHISQ = t(COEF[variables,1]) %*% solve(VCOV[variables,variables]) %*% COEF[variables,1]
PVALUE = pchisq(CHISQ, df = DF, lower.tail = FALSE)
PVALUE

##            [,1]
## [1,] 0.08860168

COEF[4,4]

## [1] 0.08895005

練習2：Wald test

現在我們想要列出兩個summary表，讓使用者進行廣義線性模式時，除了像練習-1的部分匯出95% ci外，還要匯出每個variable使用wald test的整體檢定結果(請把他整合至1個list內)：

## $SUMMARY
##          Global p value
## Diabetes "0.827"       
## SBP      "<0.001"      
## Income   "<0.001"      
## 
## $COEF
##             Estimate  Std. Error t value   Pr(>|t|)
## (Intercept) "-44.681" "0.705"    "-63.353" "<0.001"
## Diabetes1   "-0.036"  "0.163"    "-0.218"  "0.827" 
## SBP         "0.647"   "0.006"    "110.423" "<0.001"
## Income1     "0.330"   "0.194"    "1.703"   "0.089" 
## Income2     "1.182"   "0.186"    "6.340"   "<0.001"
##             Slope (95% CI)                
## (Intercept) "-44.681 (-46.065 to -43.297)"
## Diabetes1   "-0.036 (-0.355 to 0.284)"    
## SBP         "0.647 (0.635 to 0.658)"      
## Income1     "0.330 (-0.050 to 0.710)"     
## Income2     "1.182 (0.816 to 1.548)"

特別要注意的是，使用者輸入自變項時，順序若有所更改，那需要合併的項次會因此不同，你能解決這個問題嗎?

練習2答案(1)

這題的重點是要把變數資訊存下來：

my_func.glm = function (y, x_data) {
  
  lvl.y = levels(factor(y))
  n.lvl.y = length(lvl.y)  
  
  if (n.lvl.y < 10) {
    
    if (mean(lvl.y %in% 0:1) == 1) {
      
      Result = glm(y ~ ., family = "binomial", data = x_data)
      summary_Result = summary(Result)
      COEF = summary_Result$coefficients
      
      m = COEF[,1]
      s = COEF[,2]
      
      #計算下界
      ci.low = m - qnorm(0.975) * s
      
      #計算上界
      ci.up = m + qnorm(0.975) * s
      
      #轉變為OR (這是唯一增加的部分)
      m.OR = exp(m)
      ci.low.OR = exp(ci.low)
      ci.up.OR = exp(ci.up)
      
      #標準化輸出
      m.OR.3 = formatC(m.OR, 3, format = "f")
      ci.low.OR.3 = formatC(ci.low.OR, 3, format = "f")
      ci.up.OR.3 = formatC(ci.up.OR, 3, format = "f")
      
      COEF = cbind(COEF, paste0(m.OR.3, " (", ci.low.OR.3, " to ", ci.up.OR.3, ")"))
      
      colnames(COEF)[5] = "OR (95% CI)"
      
    } else {
      
      cat("y為類別變項時僅能接受0與1的類別變項\n")
      
    }
    
  } else {
    
    Result = glm(y ~ ., family = "gaussian", data = x_data)
    summary_Result = summary(Result)
    COEF = summary_Result$coefficients

    m = COEF[,1]
    s = COEF[,2]
    
    #計算下界
    ci.low = m - qt(0.975, df = summary_Result$df[2]) * s
    
    #計算上界
    ci.up = m + qt(0.975, df = summary_Result$df[2]) * s
    
    m.3 = formatC(m, 3, format = "f")
    ci.low.3 = formatC(ci.low, 3, format = "f")
    ci.up.3 = formatC(ci.up, 3, format = "f")
    COEF = cbind(COEF, paste0(m.3, " (", ci.low.3, " to ", ci.up.3, ")"))
    
    colnames(COEF)[5] = "Slope (95% CI)"
    
  }
  
  summary_result = matrix(0, ncol = 1, nrow = ncol(x_data))
  rownames(summary_result) = colnames(x_data)
  rownames(summary_result) = colnames(x_data)
  
  num_COEF = summary_Result$coefficients
  VCOV = summary_Result$cov.scaled
  start_val = 2
  
  for (i in 1:ncol(x_data)) {
    if (class(x_data[,i]) == 'factor') {
      end_val = start_val + length(levels(x_data[,i])) - 2
    } else {
      end_val = start_val 
    }
    variables = start_val:end_val
    DF = length(variables)
    CHISQ = t(num_COEF[variables,1]) %*% solve(VCOV[variables,variables]) %*% num_COEF[variables,1]
    summary_result[i,1] = pchisq(CHISQ, df = DF, lower.tail = FALSE)
    start_val = end_val + 1
  }
  
  MY_LIST = list(summary_result, COEF)
  MY_LIST
  
}

練習2答案(2)

試用一下結果：

dat[,"Income"] = as.factor(dat[,"Income"])

my_func.glm(y = dat$Disease, x_data = dat[,c("SBP", "Income")])

## [[1]]
##             [,1]
## SBP    0.0106206
## Income 0.9092092
## 
## [[2]]
##             Estimate               Std. Error           
## (Intercept) "-1.21299612535807"    "0.87057456391603"   
## SBP         "0.018343099853456"    "0.00717946299007011"
## Income1     "-0.00331543024980331" "0.254325232647553"  
## Income2     "0.11049049150071"     "0.25577927092588"   
##             z value               Pr(>|z|)            
## (Intercept) "-1.39332824049183"   "0.163520554560428" 
## SBP         "2.5549403735107"     "0.0106205989530202"
## Income1     "-0.0130361829036361" "0.989598925526899" 
## Income2     "0.431975941993863"   "0.665758898626829" 
##             OR (95% CI)             
## (Intercept) "0.297 (0.054 to 1.638)"
## SBP         "1.019 (1.004 to 1.033)"
## Income1     "0.997 (0.605 to 1.641)"
## Income2     "1.117 (0.676 to 1.844)"

my_func.glm(y = dat$Disease, x_data = dat[,c("Income", "SBP")])

## [[1]]
##             [,1]
## Income 0.9092092
## SBP    0.0106206
## 
## [[2]]
##             Estimate               Std. Error           
## (Intercept) "-1.21299612535807"    "0.870574563916031"  
## Income1     "-0.00331543024980338" "0.254325232647553"  
## Income2     "0.11049049150071"     "0.25577927092588"   
## SBP         "0.018343099853456"    "0.00717946299007011"
##             z value               Pr(>|z|)            
## (Intercept) "-1.39332824049183"   "0.163520554560428" 
## Income1     "-0.0130361829036364" "0.989598925526899" 
## Income2     "0.431975941993862"   "0.66575889862683"  
## SBP         "2.5549403735107"     "0.0106205989530202"
##             OR (95% CI)             
## (Intercept) "0.297 (0.054 to 1.638)"
## Income1     "0.997 (0.605 to 1.641)"
## Income2     "1.117 (0.676 to 1.844)"
## SBP         "1.019 (1.004 to 1.033)"

第三節：Cox比例風險模型(1)

上週的作業請大家預先預習Cox比例風險模型所需使用的套件及相關程式碼，不知道大家有沒有認真預習?

– 目前套件『survival』是最常用來做存活分析的套件，他同時支援各式存活分析的相關功能。

或是使用函數『install.packages()』安裝指定套件

install.packages("survival")

別忘記在安裝完成後我們必須使用函數『library()』載入這個套件的功能。

library("survival")

套件說明書如下

第三節：Cox比例風險模型(2)

我們要使用函數「coxph()」來進行分析，分析的語法與上面的廣義線性模式非常類似，只差在依變項因為要包含時間資訊的關係，因此比較特別一點：

dat[,"Diabetes"] = as.factor(dat[,"Diabetes"])
dat[,"Income"] = as.factor(dat[,"Income"])
Result5 = coxph(Surv(dat[,"Survival.time"], dat[,"Death"]) ~ ., data = dat[,c("Diabetes", "SBP", "Income")])
Result5

## Call:
## coxph(formula = Surv(dat[, "Survival.time"], dat[, "Death"]) ~ 
##     ., data = dat[, c("Diabetes", "SBP", "Income")])
## 
##                coef exp(coef)  se(coef)      z     p
## Diabetes1 -0.013309  0.986779  0.128309 -0.104 0.917
## SBP       -0.006637  0.993385  0.004487 -1.479 0.139
## Income1   -0.255620  0.774436  0.164798 -1.551 0.121
## Income2    0.107833  1.113862  0.147528  0.731 0.465
## 
## Likelihood ratio test=5.95  on 4 df, p=0.203
## n= 883, number of events= 448 
##    (117 observations deleted due to missingness)

Result6 = summary(Result5)
Result6

## Call:
## coxph(formula = Surv(dat[, "Survival.time"], dat[, "Death"]) ~ 
##     ., data = dat[, c("Diabetes", "SBP", "Income")])
## 
##   n= 883, number of events= 448 
##    (117 observations deleted due to missingness)
## 
##                coef exp(coef)  se(coef)      z Pr(>|z|)
## Diabetes1 -0.013309  0.986779  0.128309 -0.104    0.917
## SBP       -0.006637  0.993385  0.004487 -1.479    0.139
## Income1   -0.255620  0.774436  0.164798 -1.551    0.121
## Income2    0.107833  1.113862  0.147528  0.731    0.465
## 
##           exp(coef) exp(-coef) lower .95 upper .95
## Diabetes1    0.9868     1.0134    0.7674     1.269
## SBP          0.9934     1.0067    0.9847     1.002
## Income1      0.7744     1.2913    0.5607     1.070
## Income2      1.1139     0.8978    0.8342     1.487
## 
## Concordance= 0.535  (se = 0.016 )
## Rsquare= 0.007   (max possible= 0.997 )
## Likelihood ratio test= 5.95  on 4 df,   p=0.2
## Wald test            = 5.79  on 4 df,   p=0.2
## Score (logrank) test = 5.81  on 4 df,   p=0.2

第三節：Cox比例風險模型(3)

在解釋Cox比例風險模型的結果時，與邏輯斯回歸相當類似。同樣的，Cox比例風險模型的結果也很少使用『斜率』以及其95% CI作結果的呈現，而較常呈現『危害比(HR, Hazard ratio)』以及其95% CI

– 計算HR的方法與OR類似…

COEF = Result6$coefficients

#計算p value(使用函數「pnorm()」)
my.pvalue = pnorm(abs(COEF[,4]), lower.tail = FALSE) * 2
all.equal(my.pvalue, COEF[,5])

## [1] TRUE

#計算95% CI
m = COEF[,1]
s = COEF[,3]
ci.low = m - qnorm(0.975) * s
ci.up = m + qnorm(0.975) * s

#轉變為HR (與OR完全一樣)
m.HR = exp(m)
ci.low.HR = exp(ci.low)
ci.up.HR = exp(ci.up)

#標準化輸出
m.HR.3 = formatC(m.HR, 3, format = "f")
ci.low.HR.3 = formatC(ci.low.HR, 3, format = "f")
ci.up.HR.3 = formatC(ci.up.HR, 3, format = "f")
paste0(m.HR.3, " (", ci.low.HR.3, " to ", ci.up.HR.3, ")")

## [1] "0.987 (0.767 to 1.269)" "0.993 (0.985 to 1.002)"
## [3] "0.774 (0.561 to 1.070)" "1.114 (0.834 to 1.487)"

第三節：Cox比例風險模型(4)

同樣的，Wald test也可以用在Cox比例風險模型上面：

COEF = Result6$coefficients
COEF

##                   coef exp(coef)   se(coef)          z  Pr(>|z|)
## Diabetes1 -0.013308764 0.9867794 0.12830893 -0.1037244 0.9173881
## SBP       -0.006637461 0.9933845 0.00448735 -1.4791493 0.1391004
## Income1   -0.255619728 0.7744364 0.16479768 -1.5511124 0.1208747
## Income2    0.107832965 1.1138617 0.14752797  0.7309323 0.4648205

VCOV = Result5$var
VCOV

##               [,1]          [,2]          [,3]          [,4]
## [1,]  0.0164631803 -1.936315e-04 -8.050264e-04  4.645830e-04
## [2,] -0.0001936315  2.013631e-05  9.961013e-06 -2.069759e-05
## [3,] -0.0008050264  9.961013e-06  2.715827e-02  2.771585e-03
## [4,]  0.0004645830 -2.069759e-05  2.771585e-03  2.176450e-02

#將第3項與第4項進行整合
variables = 3:4
DF = length(variables)
CHISQ = t(COEF[variables,1]) %*% solve(VCOV[variables,variables]) %*% COEF[variables,1]
PVALUE = pchisq(CHISQ, df = DF, lower.tail = FALSE)
PVALUE

##           [,1]
## [1,] 0.1978171

– 注意：Cox比例風險模型不存在截距項，這點要特別注意喔！

練習3：擴充至存活分析

請擴充練習2函數的最終功能，將讓使用者能選擇使用線性迴歸、邏輯斯迴歸或Cox比例風險模型，並匯出類似於下面的list：

## $SUMMARY
##          Global p value
## Diabetes "0.917"       
## SBP      "0.139"       
## Income   "0.198"       
## 
## $COEF
##           coef     se(coef) z        Pr(>|z|) HR (95% CI)             
## Diabetes1 "-0.013" "0.128"  "-0.104" "0.917"  "0.987 (0.767 to 1.269)"
## SBP       "-0.007" "0.004"  "-1.479" "0.139"  "0.993 (0.985 to 1.002)"
## Income1   "-0.256" "0.165"  "-1.551" "0.121"  "0.774 (0.561 to 1.070)"
## Income2   "0.108"  "0.148"  "0.731"  "0.465"  "1.114 (0.834 to 1.487)"

練習3答案(1)

這題的重點是要條件判斷式寫完整：

library(survival)

my_func = function (y, time = NULL, x_data) {
  
  lvl.y = levels(factor(y))
  n.lvl.y = length(lvl.y)  
  
  if (is.null(time)) {
    
    if (n.lvl.y < 10) {
      
      if (mean(lvl.y %in% 0:1) == 1) {
        
        Result = glm(y ~ ., family = "binomial", data = x_data)
        summary_Result = summary(Result)
        COEF = summary_Result$coefficients
        
        m = COEF[,1]
        s = COEF[,2]
        
        #計算下界
        ci.low = m - qnorm(0.975) * s
        
        #計算上界
        ci.up = m + qnorm(0.975) * s
        
        #轉變為OR (這是唯一增加的部分)
        m.OR = exp(m)
        ci.low.OR = exp(ci.low)
        ci.up.OR = exp(ci.up)
        
        #標準化輸出
        m.OR.3 = formatC(m.OR, 3, format = "f")
        ci.low.OR.3 = formatC(ci.low.OR, 3, format = "f")
        ci.up.OR.3 = formatC(ci.up.OR, 3, format = "f")
        
        COEF = cbind(COEF, paste0(m.OR.3, " (", ci.low.OR.3, " to ", ci.up.OR.3, ")"))
        
        colnames(COEF)[5] = "OR (95% CI)"
        
      } else {
        
        cat("y為類別變項時僅能接受0與1的類別變項\n")
        
      }
      
    } else {
      
      Result = glm(y ~ ., family = "gaussian", data = x_data)
      summary_Result = summary(Result)
      COEF = summary_Result$coefficients
      
      m = COEF[,1]
      s = COEF[,2]
      
      #計算下界
      ci.low = m - qt(0.975, df = summary_Result$df[2]) * s
      
      #計算上界
      ci.up = m + qt(0.975, df = summary_Result$df[2]) * s
      
      m.3 = formatC(m, 3, format = "f")
      ci.low.3 = formatC(ci.low, 3, format = "f")
      ci.up.3 = formatC(ci.up, 3, format = "f")
      COEF = cbind(COEF, paste0(m.3, " (", ci.low.3, " to ", ci.up.3, ")"))
      
      colnames(COEF)[5] = "Slope (95% CI)"
      
    }
    
  } else {
    
    Result = coxph(Surv(time, y) ~ ., data = x_data)
    summary_Result = summary(Result)
    
    COEF = summary_Result$coefficients
    
    m = COEF[,1]
    s = COEF[,3]
    
    #計算下界
    ci.low = m - qnorm(0.975) * s
    
    #計算上界
    ci.up = m + qnorm(0.975) * s
    
    #轉變為HR (這是唯一增加的部分)
    m.HR = exp(m)
    ci.low.HR = exp(ci.low)
    ci.up.HR = exp(ci.up)
    
    #標準化輸出
    m.HR.3 = formatC(m.HR, 3, format = "f")
    ci.low.HR.3 = formatC(ci.low.HR, 3, format = "f")
    ci.up.HR.3 = formatC(ci.up.HR, 3, format = "f")
    
    COEF = cbind(COEF, paste0(m.HR.3, " (", ci.low.HR.3, " to ", ci.up.HR.3, ")"))
    
    colnames(COEF)[6] = "HR (95% CI)"
    
  }
  
  summary_result = matrix(0, ncol = 1, nrow = ncol(x_data))
  rownames(summary_result) = colnames(x_data)
  rownames(summary_result) = colnames(x_data)
  
  num_COEF = summary_Result$coefficients
  if (is.null(time)) {
    VCOV = summary_Result$cov.scaled
    start_val = 2
  } else {
    VCOV = Result$var
    start_val = 1
  }
  
  for (i in 1:ncol(x_data)) {
    if (class(x_data[,i]) == 'factor') {
      end_val = start_val + length(levels(x_data[,i])) - 2
    } else {
      end_val = start_val 
    }
    variables = start_val:end_val
    DF = length(variables)
    CHISQ = t(num_COEF[variables,1]) %*% solve(VCOV[variables,variables]) %*% num_COEF[variables,1]
    summary_result[i,1] = pchisq(CHISQ, df = DF, lower.tail = FALSE)
    start_val = end_val + 1
  }
  
  MY_LIST = list(summary_result, COEF)
  MY_LIST
  
}

練習3答案(2)

試用一下結果：

dat[,"Income"] = as.factor(dat[,"Income"])

my_func.glm(y = dat$Disease, x_data = dat[,c("SBP", "Income")])

## [[1]]
##             [,1]
## SBP    0.0106206
## Income 0.9092092
## 
## [[2]]
##             Estimate               Std. Error           
## (Intercept) "-1.21299612535807"    "0.87057456391603"   
## SBP         "0.018343099853456"    "0.00717946299007011"
## Income1     "-0.00331543024980331" "0.254325232647553"  
## Income2     "0.11049049150071"     "0.25577927092588"   
##             z value               Pr(>|z|)            
## (Intercept) "-1.39332824049183"   "0.163520554560428" 
## SBP         "2.5549403735107"     "0.0106205989530202"
## Income1     "-0.0130361829036361" "0.989598925526899" 
## Income2     "0.431975941993863"   "0.665758898626829" 
##             OR (95% CI)             
## (Intercept) "0.297 (0.054 to 1.638)"
## SBP         "1.019 (1.004 to 1.033)"
## Income1     "0.997 (0.605 to 1.641)"
## Income2     "1.117 (0.676 to 1.844)"

my_func(y = dat$Death, time = dat$Survival.time, x_data = dat[,c("Income", "SBP")])

## [[1]]
##             [,1]
## Income 0.2242644
## SBP    0.1165677
## 
## [[2]]
##         coef                   exp(coef)           se(coef)             
## Income1 "-0.213973925878094"   "0.807369436197337" "0.160557604810633"  
## Income2 "0.135928247771198"    "1.14559969131463"  "0.14487635651984"   
## SBP     "-0.00656382892311547" "0.993457665946688" "0.00418252926047729"
##         z                   Pr(>|z|)            HR (95% CI)             
## Income1 "-1.33269256308639" "0.182632714802121" "0.807 (0.589 to 1.106)"
## Income2 "0.938236238378785" "0.348123019048662" "1.146 (0.862 to 1.522)"
## SBP     "-1.569344412038"   "0.116567709083106" "0.993 (0.985 to 1.002)"

my_func(y = dat$Death, time = dat$Survival.time, x_data = dat[,c("SBP", "Income")])

## [[1]]
##             [,1]
## SBP    0.1165677
## Income 0.2242644
## 
## [[2]]
##         coef                   exp(coef)           se(coef)             
## SBP     "-0.00656382892311547" "0.993457665946688" "0.00418252926047729"
## Income1 "-0.213973925878094"   "0.807369436197337" "0.160557604810633"  
## Income2 "0.135928247771198"    "1.14559969131463"  "0.14487635651984"   
##         z                   Pr(>|z|)            HR (95% CI)             
## SBP     "-1.569344412038"   "0.116567709083106" "0.993 (0.985 to 1.002)"
## Income1 "-1.33269256308639" "0.182632714802121" "0.807 (0.589 to 1.106)"
## Income2 "0.938236238378785" "0.348123019048662" "1.146 (0.862 to 1.522)"

家庭/期中作業

現在你已經學會了如何在R內進行線性迴歸、邏輯斯迴歸或Cox比例風險模型。
在醫學期刊中，若使用這些統計方法，通常會一次看數個調整前/調整後係數的差異。
請你做一個函數，讓使用者輸入以下資訊，並獲得一張Paper上常看到的表格：

選擇線性迴歸、邏輯斯迴歸或Cox比例風險模型
輸入正確的依變項
輸入有哪些變項想要進行檢定(自變項集合)
輸入需要共同調整的變項(調整變項集合)

– 注意，假設SBP這個變項同時出現在自變項集合中及調整變項集合中，則在調整時不需要再調整它一次

以下範例為共同調整了Diabetes以及SBP

##    Independent variable Crude-HR (95% CI)     p-value 
## 1  "Diabetes"           ""                    "0.321" 
## 2  "Diabetes:0"         "1.00"                ""      
## 3  "Diabetes:1"         "0.89 (0.71 to 1.12)" "0.321" 
## 4  "SBP"                "0.99 (0.99 to 1.00)" "0.150" 
## 5  "Income"             ""                    "0.212" 
## 6  "Income:0"           "1.00"                ""      
## 7  "Income:1"           "0.81 (0.60 to 1.11)" "0.189" 
## 8  "Income:2"           "1.16 (0.87 to 1.53)" "0.310" 
## 9  "Education"          ""                    "<0.001"
## 10 "Education:0"        "1.00"                ""      
## 11 "Education:1"        "0.75 (0.61 to 0.93)" "0.007" 
## 12 "Education:2"        "0.39 (0.30 to 0.50)" "<0.001"
##    Adj-HR (95% CI)#      p-value 
## 1  ""                    "0.785" 
## 2  "1.00"                ""      
## 3  "0.97 (0.75 to 1.24)" "0.785" 
## 4  "0.99 (0.99 to 1.00)" "0.191" 
## 5  ""                    "0.198" 
## 6  "1.00"                ""      
## 7  "0.77 (0.56 to 1.07)" "0.121" 
## 8  "1.11 (0.83 to 1.49)" "0.465" 
## 9  ""                    "<0.001"
## 10 "1.00"                ""      
## 11 "0.72 (0.58 to 0.89)" "0.003" 
## 12 "0.38 (0.29 to 0.50)" "<0.001"

資料分析總結

本次課程中同學學習到的線性迴歸、邏輯斯迴歸或Cox比例風險模型是醫學期刊中非常常出現的幾個統計方法。

– 結合之前的兩節課，大部分期刊中常見的統計方法都已經提到了，儘管你可能不知道是怎麼算的，但一定要熟悉結果的解釋。

– 你應該已經學會這些東西了：

描述性統計
單變項統計(類別-類別、類別-連續)
相關性(類別-類別、類別-連續、連續-連續)
多變項統計(GLM、Cox model)

資料分析系列課程中間所做的練習題，其最終的應用可到第一節課提到的TableMaker，現在還覺得這些功能高不可攀嗎?

資料分析3